box size
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
Zhou, Yuqi, Dai, Sunhao, Wang, Shuai, Zhou, Kaiwen, Jia, Qinglin, Xu, Jun
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.
A Visual-Analytical Approach for Automatic Detection of Cyclonic Events in Satellite Observations
Agrawal, Akash, Mohapatra, Mayesh, Raja, Abhinav, Tiwari, Paritosh, Pattanaik, Vishwajeet, Jaiswal, Neeru, Agarwal, Arpit, Rathore, Punit
Estimating the location and intensity of tropical cyclones holds crucial significance for predicting catastrophic weather events. In this study, we approach this task as a detection and regression challenge, specifically over the North Indian Ocean (NIO) region where best tracks location and wind speed information serve as the labels. The current process for cyclone detection and intensity estimation involves physics-based simulation studies which are time-consuming, only using image features will automate the process for significantly faster and more accurate predictions. While conventional methods typically necessitate substantial prior knowledge for training, we are exploring alternative approaches to enhance efficiency. This research aims to focus specifically on cyclone detection, intensity estimation and related aspects using only image input and data-driven approaches and will lead to faster inference time and automate the process as opposed to current NWP models being utilized at SAC. In context to algorithm development, a novel two stage detection and intensity estimation module is proposed. In the first level detection we try to localize the cyclone over an entire image as captured by INSAT3D over the NIO (North Indian Ocean). For the intensity estimation task, we propose a CNN-LSTM network, which works on the cyclone centered images, utilizing a ResNet-18 backbone, by which we are able to capture both temporal and spatial characteristics.
Adversarial Detection: Attacking Object Detection in Real Time
Wu, Han, Yunas, Syed, Rowlands, Sareh, Ruan, Wenjie, Wahlstrom, Johan
Intelligent robots rely on object detection models to perceive the environment. Following advances in deep learning security it has been revealed that object detection models are vulnerable to adversarial attacks. However, prior research primarily focuses on attacking static images or offline videos. Therefore, it is still unclear if such attacks could jeopardize real-world robotic applications in dynamic environments. This paper bridges this gap by presenting the first real-time online attack against object detection models. We devise three attacks that fabricate bounding boxes for nonexistent objects at desired locations. The attacks achieve a success rate of about 90% within about 20 iterations. The demo video is available at https://youtu.be/zJZ1aNlXsMU.
One-Shot General Object Localization
You, Yang, Miao, Zhuochen, Xiong, Kai, Wang, Weiming, Lu, Cewu
This paper presents a general one-shot object localization algorithm called OneLoc. Current one-shot object localization or detection methods either rely on a slow exhaustive feature matching process or lack the ability to generalize to novel objects. In contrast, our proposed OneLoc algorithm efficiently finds the object center and bounding box size by a special voting scheme. To keep our method scale-invariant, only unit center offset directions and relative sizes are estimated. A novel dense equalized voting module is proposed to better locate small texture-less objects. Experiments show that the proposed method achieves state-of-the-art overall performance on two datasets: OnePose dataset and LINEMOD dataset. In addition, our method can also achieve one-shot multi-instance detection and non-rigid object localization. Code repository: https://github.com/qq456cvb/OneLoc.
Sequential Community Mode Estimation
Jain, Shubham Anand, Goenka, Shreyas, Bapna, Divyam, Karamchandani, Nikhil, Nair, Jayakrishnan
We consider a population, partitioned into a set of communities, and study the problem of identifying the largest community within the population via sequential, random sampling of individuals. There are multiple sampling domains, referred to as \emph{boxes}, which also partition the population. Each box may consist of individuals of different communities, and each community may in turn be spread across multiple boxes. The learning agent can, at any time, sample (with replacement) a random individual from any chosen box; when this is done, the agent learns the community the sampled individual belongs to, and also whether or not this individual has been sampled before. The goal of the agent is to minimize the probability of mis-identifying the largest community in a \emph{fixed budget} setting, by optimizing both the sampling strategy as well as the decision rule. We propose and analyse novel algorithms for this problem, and also establish information theoretic lower bounds on the probability of error under any algorithm. In several cases of interest, the exponential decay rates of the probability of error under our algorithms are shown to be optimal up to constant factors. The proposed algorithms are further validated via simulations on real-world datasets.
Compositional Generalization in Image Captioning
Nikolaus, Mitja, Abdou, Mostafa, Lamm, Matthew, Aralikatte, Rahul, Elliott, Desmond
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
A Machine Learning Approach to Shipping Box Design
Having the right assortment of shipping boxes in the fulfillment warehouse to pack and ship customer's online orders is an indispensable and integral part of nowadays eCommerce business, as it will not only help maintain a profitable business but also create great experiences for customers. However, it is an extremely challenging operations task to strategically select the best combination of tens of box sizes from thousands of feasible ones to be responsible for hundreds of thousands of orders daily placed on millions of inventory products. In this paper, we present a machine learning approach to tackle the task by formulating the box design problem prescriptively as a generalized version of weighted k-medoids clustering problem, where the parameters are estimated through a variety of descriptive analytics. We test this machine learning approach on fulfillment data collected from Walmart U.S. eCommerce, and our approach is shown to be capable of improving the box utilization rate by more than 10%. Keywords: Shipping box design, k-medoids clustering, eCommerce, packaging science, operations research 1 Introduction The assortment of shipping boxes utilized by the fulfillment warehouse to pack and ship customer's online orders is a critical component of nowadays eCommerce business, as it will directly affect not only profit margins but also customer's experience.
Text Box Size, Skill, and Iterative Practice in a Writing Task
Raine, Roxanne Benoit (University of Memphis) | Mintz, Lisa (University of Memphis) | Crossley, Scott A. (Georgia State University) | Dai, Jianmin (University of Memphis) | McNamara, Danielle S. (University of Memphis)
Although freewriting strategies are commonly taught in composition courses, there have been few empirical studies on freewriting. We address this gap by examining effects of prior writing skills (as measured by a pre-write essay), freewriting training, text-box size (1, 10, 20 lines), and repetitive writing on freewriting quality. Participants watched an agent-based vicarious learning freewriting instruction video or a control video including brief instructions on freewriting. After training, participants wrote six freewrites, two in each box size. Lesson delivery and text box size did not affect expert human ratings of the freewrites. Furthermore, participants did not benefit from writing successive freewrites regardless of their initial skill level. We describe how these results have been used to inform the design of Writing-Pal, an essay-writing intelligent tutoring system.